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Abstract 

Due to their great flexibility, nonparametric Bayes methods have proven to be a valuable tool 
for discovering complicated patterns in data. The term "nonparametric Bayes" suggests that 
these methods inherit model-free operating characteristics of classical nonparametric methods, 
as well as coherent uncertainty assessments provided by Bayesian procedures. However, as the 
authors say in the conclusion to their article, nonparametric Bayesian methods may be more 
aptly described as "massively parametric." Furthermore, I argue that many of the default non- 
parametric Bayes procedures are only Bayesian in the weakest sense of the term, and cannot be 
■^ assumed to provide honest assessments of uncertainty merely because they carry the Bayesian 

label. However useful such procedures may be, we should be cautious about advertising default 
nonparametric Bayes procedures as either being "assumption free" or providing descriptions of 
our uncertainty. If we want our nonparametric Bayes procedures to have a Bayesian interpre- 
tation, we should modify default NP Bayes methods to accommodate real prior information, 
or at the very least, carefully evaluate the effects of hyperparameters on posterior quantities of 
interest. 
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1 Parameteric and nonparametric approaches 

Historically, a standard justification of Bayesian methods has been that they provide an internally 
consistent approach to updating information: If Vq = {p(y\9) : 9 £ 0} expresses our beliefs about 
Y given 9, and ir(9) expresses our beliefs about 9, then 7r(9\y) ex ir(9)p(y\9) expresses what we 
should believe about 9, having observed Y = y. From this subjective Bayesian point of view, 
for ir(9\y) to be of most use, both p(y\9) and tt(9) should actually represent our beliefs, at least 
approximately. A criticism of parameteric Bayesian methods is that commonly used models Vq 
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are often suspected of being wrong. Nonparametric Bayes methods appear to solve this problem 
by making V® so large that it includes essentially all relevant sampling distributions p(y\8). The 
authors seem to suggest that NP Bayes methods therefore provide an "honest representation of 
uncertainties". I would agree with this, to the extent that ir(9) actually represents prior beliefs. 

How honest are parametric and nonparametric priors? Parametric priors are arguably inaccu- 
rate as they assign probability one to a simple parametric model. However, the great advantage of 
parametric Bayesian approaches is that they allow the prior to be specified in terms of parameters 
of interest, which often happen to be the parameters about which we have real prior information. 
As a very simple example, suppose we have a sample yi,...,y n of independent observations from 
a population for which we have prior information about the mean and variance. In this case, a 
limited form of subjective, robust Bayesian inference for the population mean 9 can proceed via 
the posterior density obtained from a normal sampling model for y±, . . . ,y n . While the likelihood 
may not be exactly correct, the resulting inferences for 6 are robust to nonnormality, asymptoti- 
cally correct and provide confidence intervals for which the asymptotic frequentist coverage equals 
the asymptotic Bayesian coverage. Perhaps most importantly, the inference is transparent: the 
approximate normality of y is well understood, and the effect of the prior on the parameter is 
simple, especially if a conjugate prior is used. Even if the prior does not represent our actual prior 
information, at least we can understand what information it represents. 

In contrast, I think it is safe to say that for most NP Bayes methods used in practice, the 
prior does not represent actual prior beliefs, even approximately. One difficulty is that standard 
NP Bayes priors include hyperparameters that directly control things that we are unlikely to have 
prior information about (the number of modes of a density) and only indirectly control things we 
might have information about (means, variances and correlations). For example, the choice of the 
hyperparameters in the prior for a Dirichlet process mixture model (DPMM) induces a prior on the 
mean and variance of the po pulation , but the mapping from the hyper parameters to these induced 
priors can be very opaque ( Yamatol . Il984l ; iLijoi and Regazzinil . |2004| ). Similarly, the Polya tree 



priors discussed in Section 2.2 require the specification of a partition over the sample space, the 
choice of which will generally affect the posterior. The "solution" to this is the addition of a prior 
over the set of possible partitions. It is hard to imagine that such a prior represents actual prior 
information about the underlying population. 

Do such complications warrant abandoning NP Bayes methods and using simpler parameteric 
approaches? It may depend on the data analysis objectives. NP Bayes methods provide a flexible 
means of representing high-dimensional data structure. Overfitting is avoided by a regularizer (the 
prior) that has a probabilistic interpretation. These features make NP Bayes methods an attractive 
set of tools for such tasks as prediction and clustering. However, if the data analysis objective is to 
describe our posterior information about a parameter of interest, then the appropriateness of NP 



Bayes is less clear. Example 1 from lMiiller and Mitral (|2013l ) is a situation where the use of an NP 



Bayes method may be obfuscating the sources of information about the parameter interest, -F(O). 
If we are to take the likelihood at face value, then it seems the data have little to say about the 
value of -F(O): Writing /& = F(k) and letting n& be the number of T-cell sequences for which k 
replicates were observed, the likelihood can be expressed as p(y\fo, ■ ■ ■ , fi) = Ylk=i(fk/[^- ~ /o]) nfe - 
How much information do the data provide about /o? One way to evaluate this is to consider 
the profile likelihood function of fo. For every fixed value of fo the likelihood p(y\fo, ■ ■ ■ 5/4) = 
rifc=i (/*:/[! ~~ /o])™ fe is maximized in /1, . . . , /4 at /& = (1 — fo)rik/n. This gives a constant profile 
likelihood function for fo, equal to Y\.{ n kl n ) nk f° r every value of fo- From a Bayesian perspective, 
the posterior distribution of fo can be expressed as Tt(fo\y) oc ir(fo)p(y\fo), where 

p(y\fo) = / p(y\fo, ■■■, /4M/1, • • • , / 4 |/o) dfx--- df 4 . 

The profile likelihood argument suggests that p(y\fo) will be fairly flat as a function of fo, especially 
if the prior over fi, ■ ■ ■ , fi is "diffuse," in which case 7r(/o|y) ~ tt(/o)- in this case, an "honest 
assessment" of posterior uncertainty about /o requires only an honest prior for fo- Alternatively, if 
we believe there to be a relationship among fo,- ■ ■ ,fi beyond the fact that /o + • • • + fi < 1, then 
again an honest assessment of posterior uncertainty about /o for these data requires only an honest 
specification of ones joint beliefs about the five numbers /o,---,/4- In either case, the DPMM 
over {/o,/i, • • •} seems like a very indirect and opaque way to specify a prior over the relevant 
parameters. 

One could argue that the DPMM in this example is helping us estimate other aspects of the 
unknown frequencies, such as the relative frequencies, perhaps. However, even though the DPMM 
in this example is billed as a "nonparametric Bayes" procedure, it is not without strong model- 
ing assumptions. Specifically, this DPMM assumes the true distribution is a mixture of Poisson 
distributions - a class of distributions that does not contain all discrete distributions. 

2 Partial remedies for particular situations 

Alternative likelihoods: Consider a model {p(y\f) '■ f £ J 7 } where / is a high dimensional 
parameter (such as a regression function or density). In situations where primary interest is in a 
low-dimensional parameter 6 = 6(f), the difficulties of infinite-dimensional prior specification can 
sometimes be avoided by using a likelihood that involves only 9. For example, in many problems 
there exists a statistic t(y) whose distribution depends only on 0, and not the high-dimensional 
parameter /. In such cases, the likelihood can be expressed as 

P(y\f)=p(t(y)\e)xp(y\t(y),9,f). 



The need to specify a prior over / can be avoided by constructing a posterior distribution for 9 
based only on the marginal likelihood p(t(y)\9), i.e. n(9\t(y)) oc ir(9)p(t(y)\9). Estimates based on 
such a posterior distribution could be inefficient, as they ignore any additional information about 9 
in p(y\t(y),9, /), but they do not require specification of a prior fo r the high-d imensional nuisance 



parameter /. A concrete example of such a procedure is given in iHoffl (J2007I ) in the context of a 
semiparametric copula model, in which 9 represents the parameters in a parametric dependence 
model and / parameterizes a set of unknown infinite-dimensional univariate marginal distributions. 
Many researche rs hav e cons i dered other a l ternative likelihood s for robust or "nonparametric" 
Bayesian inference (JEfronl (|1993l ); 



Lazarl (|2003) 



Greco et al. 



(2008), to name a few). Asymptotically 



correct likelihoods for parameters of interest can even be derived from misspecified models: Very 
generally, the limiting distribution of the MLE 9 in a misspecified model is asymptotically normal, 
so that 

y/K(6*-0) ~N(0,V(9*)) 



where 9* is the "pseudotrue" parameter and V(9*) is the "sandwich" variance ([Hubert 1 196 71 ). In 
many cases (such as in exponential family models) 9* is a population moment, and possibly the 
parameter of interest about which we may have prior information. In this case, nonparametric 
Bayesian inference can be obtained by combining a prior on 9* with the asymptotic n ormal distri- 
bution of 9 as a l ikelih ood. Such "Bayesian sandwich" procedures have been considered ISzpiro et al 



mm, 



Midler (2012) and 



2, 



Hoff and Wakefield 



feoij ) 



Marginally specified priors: Such reductions of the parameter space are not feasible in applica- 
tions such as prediction, where the high- or infinite-dimensional parameter / is of primary interest. 
In such cases it is important from a Bayesian perspective that the prior for / reflects known in- 
formation as much as possible. Realistically, a statistician is unlikely to have informed opinions 
about all aspects of a high-dimensional parameter /, but may have real information about a finite- 

9(f), such as the population mean or variance. Recently 



dimensional functional 9 



Kessler et al. 



(J2012J) have proposed an approach for incorporating prior information about 9 into a default NP 
Bayes prior for /. Specifically, let ttq be a prior for / that is chosen arbitrarily or for computational 
convenience. This prior induces a marginal prior on 9, say Pq, that may not reflect actual prior 
information, as quantified by a distribution Pi. To remedy this problem, first express the default 
prior ttq as 

7T (fGA)= I Mf e A\9)P (d9). 



To obtain a prior m over / with the desired marginal distribution Pi, simply replace Pq with Pi 
in the above expression. The resulting prior on / then becomes 



?ri(/ € A) = J 'Mf € A\9)P x {d9). 



Such a prior generally retains the good large-support properties of the default NP Bayes prior ttq, 
but has an induced prior over 9 that matches the actual prior information P\. Computation of 
the posterior under such a prior is also generally available if an MCMC algorithm exists for the 
default prior ttq. In this case, posterior approximation under i\\ can be made with the addition of 
a Metropolis-Hastings step. 

Noninformative priors: In the absence of prior information, NP Bayes practitioners may at- 
tempt to produce "diffuse" priors by adjusting the hyperparameters in some way. However, intuition 
about what parameters correspond to "noninformativeness" can be misleading, partly due to the 
terminology used in the NP Bayes literature. For example, NP Bayes researchers should make 
clear that the total mass parameter a in a DPMM controls much more than "the uncertainty of" 
the mixing measure. As DPMM researchers know, this hyperparameter controls such things as the 
entropy of the resulting probability density, the number of modes, etc. As another example, practi- 
tioners sometimes select overdispe rsed base measure s in DPMMs in the hope that these reduce the 



effect of the prior on the analysis. iBush et al.l (|2010l ) have shown that such attempts generally lead 



to an unreasonably small number of mixture components, and propose a more nuanced version of 
the total mass hyperparameter to achieve a type of "noninformative" NP Bayes analysis. 

3 Conclusion 

Standard data analysis procedures can generally be described as techniques that convert a large set 
of numbers (the data) into a smaller set of numbers (parameter estimates, standard errors, etc.). 
Ideally, such a procedure is statistically meaningful and reasonably transparent: meaningful in that 
it has some desirable property in an idealized situation, and transparent so that its behavior can 
be understood outside of the idealized situation. Conjugate Bayesian estimation in an exponential 
family model is a good example of a meaningful and transparent procedure: The properties of such 
procedures are well-understood in the subjective Bayes framework, in an asymptotic framework 
and even in settings where the model is misspecified. In many cases, incorrect parameteric models 
can provide meaningful, transparent and accurate inference for certain population parameters of 
interest, if not for all aspects of the population. 

NP Bayes procedures typically convert a small set of numbers (the data) into a much larger set 
of numbers (the posterior distribution of the infinite dimensional parameter). What meaning can 
such extrapolative procedures have? Asymptotic results assure us that many NP Bayes procedures 
converge to the truth as fast as other nonparametric procedu res, and perha ps faster if the the prior 



is close to the truth in a topological sense (see, for example, iGhosall (|200ll )). It might even be the 
case that 95% posterior confid ence inte r vals 



often do in parametric models (jSeverinil . 



1991 



rave approximate 95% frequentist coverage, as they 
)• 



How are NP Bayes procedures justified non-asymptotically? Small sample justifications of 
Bayesian procedures are often based on their optimality under a particular prior. In simple models, 
this justification is transparent, in that even if the prior doesn't represent one's actual beliefs, at 
least one understands what beliefs it represents. Prior specification for NP Bayes procedures are 
more opaque and harder to justify from a Bayesian perspective. Default prior distributions are not 
generally going to represent prior beliefs, making it difficult to interpret the corresponding posterior 
distributions as posterior beliefs (except perhaps asymptotically). In terms of transparency, it is 
certainly possible to gain a strong intuition for the effects of hyperparameters on posterior output. 
Yet expert nonparametric Bayesians frequently use terminology that may be misleading to less 
experienced NP Bayes practitioners: For example, referring to the mass parameter a in a DPMM 
as indexing uncertainty is an incomplete description at best. Referring to a DPMM as a method 
for "BNP inference" on a clustering overlooks, as the authors pointed out in Section 3.1, that the 
Polya urn scheme is a very particular one-parameter partition model. Referring to a mixture of 
Poisson distributions as "nonparametric" may give the impression to the inexperienced reader that 
the resulting mixture model contains all discrete distributions. 

Most importantly, a posterior distribution does not provide an honest assessment of uncertainty 
by virtue of being a posterior distribution. Such an assessment is obtained via either an honest 
prior or asymptotically. In the absence of an infinite sample size, considerable effort should be 
made to use a prior distribution that approximates as closely as possible any real prior information 
that is available. In the absence of prior information, a more complete (but tedious) description of 
uncertainty would include a sensitivity analysis over possible values of the hyperparameters. 

I am certainly not arguing that such efforts regarding the prior be mandated for every application 
of an NP Bayes method. However, I feel that more effort in this direction is necessary if we want 
our posterior distributions to represent honest assessments of uncertainty. 
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